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ABSTRACT OF THE DISSERTATION 


A Dataflow Architecture with Improved 


Asymptotic Performance 


by 


Robert Eugene Thomas 
Doctor of Philosophy in Computer Science 
University of California, Irvine, 1981 


Professor Kim P. Gostelow, Chair 


Large scale integration presents a unique opportunity 
to design a computer comprising large numbers of small, 
inexpensive processors. This paper presents a design for 
such a machine based on the asynchronous and functional 
semantics of dataflow. Processors within the machine are 
interconnected by a packet-switched binary n-cube although 
a limited number of other networks may be substituted with 
predictable asymptotic effects on performance. Improved 
performance of the proposed machine over a previously 


reported dataflow architecture is predicted in terms of the 
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computational time soaprexiey of several example programs: 
matrix multiply, quicksort, and iterative solutions to 
partial differential equations. Although the example 
Programs are numerical in nature, the machine is intended 
for general-purpose computation since programs are written 
in the high level dataflow language Id without knowledge of 
the number of processors or interconnections. New storage 
management and data communication methods are also 
presented which are necessary to obtain the improved 
performance. Experimental results from a simulated machine 
incorporating some of these methods are given to 


corroborate analytic results. 
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1.6 INTRODUCTION 


Large scale integration presents an opportunity to 
design a computer comprising hundreds or thousands of 
small, inexpensive processors. This opportunity is 
attractive for several reasons. First, signal propagation 
delay will eventually limit the performance of conventional 
sequential computers. Thus multiple execution units of 
some sort (e.g., arithmetic/logic units, processors) will 
eventually be necessary to increase performance further. 
Second, the current trend of rising software costs relative 
to hardware costs warrants, in many cases, trading 
inexpensive hardware for ease of software production. One 
of our approaches for realizing this tradeoff is to 
transfer the responsibility for processor and memory 
Management from the programmer to the machine. We view 
automatic resource management as a potential source of 
additional parallelism which, if suitable exploited, would 
mitigate performance losses normally associated with 
dynamic resource management. A third reason multiprocessor 
computers are attractive is that redundant processors and 
communication links may be used to continue computation in 
the face of certain hardware faults. While error detection 
and control are beyond the scope of this paper, we believe 
the principles of dataflow underlying the architecture 
described here provide new opportunities for supporting 


high performance “fail-soft" computing. 


1.1 Principles of Dataflow 


Dataflow is a model of computation based on asynchrony 
and functionality. Asynchrony means a dataflow operation 
(e.g., a machine instruction) may begin execution any time 
after receiving its input operands. Functionality means 
every dataflow operation consumes a set of input values and 
creates a set of output values without side-effects. 
Asynchrony is the basis of concurrency in dataflow while 
functionality ensures concurrent operations do not 
interfere and therefore need not be artificially sequenced. 
Detailed descriptions of various dataflow computational 
models and their advantages have been presented elsewhere 


{3,12,13,22,48]. 


1.2 Dataflow Architectures and Complexity Analysis 


Many dataflow architectures have been proposed 
{11,12,14,15,22,23,33,42] and a few prototypes have been 
constructed [11,12,23]. Evaluation of these architectures 
is an important task, but so far this evaluation has been 
more art than science. In an effort to improve the 
Situation, time complexity analysis was used to design and 
evaluate the architecture presented here. Although the 
O-notation used in the complexity analysis is admittedly a 
rough tool (large constants may be hidden), such analysis 


can quickly determine the presence of bottlenecks in large 


computations. Some readers may also question the 
assumption used in the complexity analyses that an 
unbounded number of processors are available. Of course, 
any realizable machine is limited to a finite number of 
processors, communication lines, memory cells, etc. 
However, the necessarily limited resources of von Neumann 
machines has not lessened the value of complexity analysis 
for single processors. Therefore, in the same way that 
unbounded memory is assumed in conventional complexity 
analysis, unbounded memory and processors will be assumed 
in the parallel complexity analyses presented here. 

An important practical issue in the application of 
dataflow and complexity results to real systems is hardware 
and software cost. All too often studies of parallel 
computing models do not intend that the model be 
implemented (i.e., it is a theoretical model only) or _ the 
analysis gives little or no aid to the programming of the 
Proposed machine. At least one solution to this problem 
has been achieved and tested by simulation in a dataflow 
architecture [22]. There it was shown that general 
programs can be written in a high level language, Irvine 
dataflow (Id) [3]. Only one compilation is then required 
for multiple executions with various sized data. An 
important result is that parallel programs can be written 
without knowledge of the number or the interconnection of 


processors. The methods used to obtain these results are 


extended in this paper to a new dataflow architecture which 
has improved asymptotic performance over that described in 
[22]. 

Section 2 of this paper describes a simple parallel 
computer model used for the complexity analysis of some 
common algorithms. Section 3 describes how the theoretical 
model of Section 2 can be implemented in a dataflow 
environment. Section 4 presents experimental evidence 
(derived from executing real programs on aé_e simulated 
dataflow machine) which lends support to selected results 


from Section 3. 


2.8 AN ANALYTIC, PARALLEL COMPUTER MODEL 


Complexity analysis requires an explicit model of the 
computing device concerned. Examples of models commonly 
used in complexity analyses are the Turing machine and the 
"random access machine" [1]. In this section, a simplified 
model of a parallel architecture is described for the 
Purpose of complexity analysis. Implementation of the 
model will be discussed in Section 3. 

Our parallel computer model comprises an unbounded 
number of processing elements (PES) interconnected by a 
communication network. Intuitively, each PE may be 
considered a conventional processor directly connected to a 
private, unbounded memory. Network communication is 
assumed to operate ona “store and forward" packet basis. 
For purposes of analysis (as opposed to implementation), 
all PES and communication links are synchronized by a 
central clock. This simplifies analysis by reducing the 
need for methods based on probabilities. Although the 
results thereby achieved apply only indirectly to the 
implementation discussed in Section 3, it is considered 
prudent to use this kind of analysis before more detailed 
analyses are conducted. 

The communication network has considerable impact on 
the cost of implementing the computing device. For example 
the crossbar network, e.g. [45], has cost 0(N7) where N is 


the number of nodes, and thus the communication network 


clearly dominates cost for large N. Although a number of 
networks costing less than O(N?) have Beat proposed 
{[7,8,18, 26, 31, 32,34, 36,37,41,43], relatively little 

comparative information has been published about them. A 
good start in this direction is the work of Siegel [34], 
and Wu and Feng [44]. Wu and Feng discussed the 
equivalence of several networks while Siegel compared a 
small number of quite different networks in terms of the 
computational time complexity of simulating one network 
using another. Such comparisons are significant because 
they allow complexity results derived from one heteabk to 
be applied to other networks. ‘This is one reason why one 
of the networks studied by Siegel, the binary “n-cube, has 


been chosen as the network of our parallel computer model. 


2.1 The Binary n-cube 
4 


\ 


A binary n-cube is an interconnection of N = 2” PEs 
Placed at the corners of an n-dimensional cube. Each edge 


or link of the cube has two. PEs; each PE has n 


ce ee cs ee we ew ee ee ee ee oe ee 


loareful interpretation of Siegel's results is necessary 
since these results were developed for single instruction 
stream-multiple data stream (SIMD) computers [16] whereas 
the model described here is not a SIMD because each PE is 
assumed to have its own instruction stream. An example use 
of Siegel's results appears in Section 5: Conclusions. 


bi-directional one-message-at-a-time (i.e., half-duplex) 
links connecting it to nother PEs. Examples of n-cubes 
are given in Figure 2.1. This paper assumes that at any 
given instant each PE can transmit or receive (but not 
both) on any one of the n links connected to it, although 
more concurrent implementations are also possible. The 
interconnections of the n-cube can be expressed formally 
using P,_1.--Pg to denote the binary address of an 
arbitrary PE and P, as the complement of p,;. The ith 


function defining the n-cube interconnections is given by 


Cubes (Py_y+++Pi41PiPi-1++*Pg) = 


Ph-1e**Pi41P4Pi-j:+*Pg O< i<n 


In the sequel, the notation for a particular communication 
link will be abbreviated from cube, (x) to cube; when the 


NX 
address x is obvious from context. 


2.2 Binary n-cube Properties 


The Hamming distance between two binary numbers (PE 
addresses) is the number of bit positions which differ in 
the two numbers. Let bit (z) denote the jth bit in address. 
ze The following routing algorithm may be used to direct a 
message from PE x to PE y. Select any i such that &@<i<n 
and bit, (x) # bit, (y). If no such i exists then the 


message has arrived; otherwise, transmit the message using 


function cube,(x) and repeat with the new address x. This 
has the effect of reducing the Hamming distance by one at 
each step of transmission, and since the largest Hamming 
distance is n = log N, at most log N steps2 are needed to 
transmit a message. This routing algorithm also implies 
that if two PES are separated by Hamming distance m then m! 
distinct paths exist between the two PEs. Another 
interesting capability of the n-cube is the most distant N 
message transfer described by Sullivan and Bashkow [37]. 
In this transfer, each PE (concurrently with all other PEs) 
sends a single message to that PE at the greatest Hamming 


distance from it. The algorithm3 is as follows: 


Algorithm 2.1 [37]. 


For i from @ to n-1 do 
Using links cube;, all PEs transmit/receive 
each message with destination address which 
differs from the message's current address in 


the ith bit position; 


“aii logarithms will be taken to the base two. 


3 ‘ : , : 
The notation "For i... do" implies that every PE uses 
the same value of i at the same time in the order given. 


To see how this works, note each PE initially contains one 
message at Hamming distance n from its destination. Since 
N/2 PEs are directly connected in an n-cube to the other 
N/2 PEs by links cube, for each i (i.e., there is a 
distinct partitioning for each i), N/2 messages can be 
exchanged in two time steps for each iteration thus 
bringing all messages - one step closer to their 
destinations. Therefore, 2 log N steps4 are required to 
complete the transfer. Given the assumption that at any 
given step a PE may service only one of the n links to 


which it is connected”? this algorithm is optimal since: 


l. At each step the Hamming distance of all messages 
transmitted is decreased by the maximum possible, 
i.e., by one; 


2. Every step uses the maximum possible concurrency, 
i.e., N/2 transmissions. 


The following four capabilities of the n-cube will be 


used extensively in the sequel. The first is the N-way 


“th [37] only log N steps were required because full-duplex 
links were assumed. 


This assumption precludes "pipelining" of the transfer 
algorithm as discussed in [37]. If each PE could 
concurrently transmit/receive on all links to which it is 
connected, then another (new) message could be started 
immediately after the first message has departed from its 
source PE since a given link is used only once in the 
execution of the algorithm. This would allow each PE to 
send. m distinct messages to the PE most distant from it in 
(m-1)+log N steps. However, with the assumptions used in 
this paper, 2m log N steps are required for this transfer 
since the algorithm must be repeated m times. 
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broadcast which distributes a single message from one PE to 
all other PEs in log N steps [37]. Assume the message to 
be broadcast is transmitted to the original broadcasting PE 


on a hypothetical link cube). The algorithm is: 


Each PE that receives a broadcast message on link 


cube, retransmits the message using (in order) 


links Canes for i<j<n. 


The first transmission (for j=8) from the original 
broadcasting PE using cube, can be thought of as splitting 
the original n-cube into two disjoint, identical cubes of 
size N/2. Two PES (the original and the receiver on link 
cube, ) now have the message and each becomes the source to 
broadcast the message to the sub-cube in which that PE 
resides. This process is repeated until the resulting 
sub-cubes contain only one PE which terminates’ the 
broadcast. 

The second important capability of the n-cube is the 
N(N-l1) transfer where each PE transmits N-l distinct 
messages, one to each of the other N-l PEs. With this 
transfer, N PES each send (N-l1) messages’ for a total 
delivery of N(N-1) messages in N log N steps. (Note that 
the N(N-1) transfer is easily adapted to perform matrix 
transpose assuming each PE initially contains exactly one 
row of the matrix.) The algorithm for the N(N-1) transfer 


is the same as the one used in the most distant transfer, 


i.e., Algorithm 2.1. 
Theorem: The N(N-1) transfer can be done in N log N steps. 


Proof by Induction: The basis is trivially true for N = 2. 
Inductive step: Assume the N(N-l) transfer requires 
N log N' steps for an n-cube of size N. Let the address of 
an arbitrary PE of an (n+l)-cube of size 2N be PieeePge 
Each PE of the (n+l)-cube starts with 2N-1 messages; N of 
these messages will have destinations d with bitg(d) = Dg 
and N-l will have destinations with bit, (a) = Pg since no 
PE sends a message to itself. Using Algorithm 2.1, for i= 
each PE thus transmits N messages and receives N messages 
using a total of 2N steps. Exactly one of the messages 
received by each PE must be addressed to itself so each PE 
now contains 2N-2 = 2(N-1) undelivered messages which are 
addressed exactly like the messages of two N(N-1) transfers 
within each of the two sub-cubes defined by the set of 
addresses P...p,. Since links cube; will not be used 
again, each of these two sub-cubes can act independently. 
By the inductive assumption the two N(N-l) transfers within 


each Sub-cube require 2N log N steps. The total is 2N + 


2N log N = 2N(log N + 1) = 2N log 2N steps. (] 


Again, this is optimal under the given assumptions for the 
Same reasons as were given in the most distant N message 
transfer discussion. 


The third important capability of the n-cube is the 
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N-l1 linear transfer where a single PE sends a distinct 
message to each of the other N-l PEs in O(N) steps. 
Although an algorithm seems to exist® for doing this 
transfer in exactly N-1l steps, the proof is nontrivial and 
its description is not needed for the purpose of this 
paper. The O(N) algorithm is simply transmit messages (in 
any order) at every other time step. This algorithm 


requires 2(N-l1) steps for transmission from the source PE 
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plus at most log N' steps for the last message to arrive. 


since no conflicts are possible. Thus 0O(2(N-1)+log N) = 
O(N) steps are required. 

The fourth important capability of an n-cube is’ the 
ease and flexibility with which partitions may be defined. 
Some of these partitions are given in the following 
definitions. A k,m7partition is a set of m, @<m<n, 
distinct integers j, 8<j<n, specifying the partitioning of 
an n-cube into k disjoint m-cubes. Each of these m 
integers represents a distinct bit position in the n-cube 


PE address Pa-1e+ePg- 


Theorem 2.1. Let k=2n7-M, Then there are n!/(m!(n-m) !) 


distinct kK, m7 partitions of an n-cube. (See Figure 2. 2a 


for example.) 


6_. . 
Send messages in order of decreasing destination Hamming 
distance and select links so that no one link is used twice 
in succession. 


Proof: Consider for the moment that the n-m bits not in 
the partition are fixed to some arbitrary value. Then the 
m bit positions in the partition define a set of 2™ 
distinct PE addresses which can be re-labeled to integers 
j, 0<4<2™, by ignoring the other n-m bit positions. These 
re-labeled PEs and their connections satisfy the cube 
interconnection functions and thus define an m-cube. There 
are k=2" ™ such m-cubes since the n-m fixed bits may assume 


> a 


a different values. All m-cubes are disjoint because 
their original addresses are distinct and because an m-cube 
link must connect two PEs within the same m-cube. Finally, 
there are ni/(m!(n-m)!) distinct combinations of n bit 


positions taken m at a time each of which defines a 


distinct partition of an n-cube. [] 


The complement of a Ky m7Partition is the set {i | 8<i<n} 
pcAictad rato alia ; = 
= (Ky m7 partition), i.e., all bit positions not in the 


Ky ,m7partition. 


Corollary 2.la. Each (n-m) -cube specified by the 


complement of a Ky ,m7Ppartition shares exactly one PE with 


each m-cube specified by the k -partition (Figure 2. 2b). 


n,m 


Proof: Follows immediately from the proof of Theorem 2.1 
by reversing the bit positions which are fixed with those 


that are variable. {] 
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2.3 Example Complexity Analyses 


This subsection presents the complexity analysis of 
three numerical algorithms. The methods and results are 
intended to demonstrate the n-cube's .§ capability for 
concurrent communication and to serve as a basis for 
generalizing the methods to  non-numerical algorithms. 
Besides the already mentioned assumptions, the analyses 
assume the “uniform cost criterion" [1]. This means’ that 
primitive machine operations such as +, *, etc. are assumed 
to take constant time regardless of the size of the 


operands. 


2.3.1 Related Work and Data Structure Assumptions - 

The application of parallel. processors to numerical 
problems has been studied for some time. For example, 
Squire and Palais give a program (without analysis) for 
matrix inversion on a proposed parallel machine 
incorporating a circuit-switched binary n-cube [35]. Many 
studies have been done for Illiac IV-like interconnections 
e.g., [(19,25,36,39]. The advent of VLSI has fix ther 
encouraged work in the area of "computation grids" [28]. 
The usual assumption made in these studies is that a PE 
works with a constant, usually small, number of data 
elements. One reason for this is the speed of the results 


obtained; for example, an O(log N) algorithm has been 
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shown for NxN matrix transpose on the perfect shuffle 
network [36]. 

The difference between the current approach and_ the 
others cited is that here the PEs work with complete rows 
of data instead of single elements. Although this approach 
may result in an increase in time complexity (e.g., O(N) to 
O(N log N) for matrix multiply), aggregates of data larger 
than a single element are required for the implementation 
Proposed in Section 3 which is intended to avoid one of the 
Problems of "array computers": the exacting data layout 
and communication requirements that make such computers 
difficult to program. Furthermore, since the location of 
individual data elements is usually implicitly buried in 
the user's program, continuing operation with the loss of 
just one communication link or processing element becomes a 
difficult problem. An alternative approach using 
aggregates of data combined with dataflow allows the 
physical location of data to be divorced from the user's 
program as is shown in Section 3. The scheme proposed 
there allows the machine to function as long as at least 
one PE remains operational and sufficient memory is 
available (although time complexity may, of course, 


suffer). 
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2.3.2 NxN Matrix Multiply - 

In this subsection, it is shown that two NxN matrices, 
A and B, can be multiplied in O(N log N) time (N=2" without 
loss of generality) using a 2n-cube (n2 PEs) when the 
location of the input is favorably distributed. The 
average time required over all possible input distributions 
is unknown at this time but is conjectured to be O(N log N) 


assuming no PE begins or ends with more than a_ constant 


number of input or result rows. 


2.3.2.1 One Possible Input Row Distribution - 

Let Pon-1°**PnPn-1-+*Pg denote a PE address in the 
2n-cube. The N rows of matrix A are distributed over N 
distinct PEs such that the address of those PEs' satisfy 
Pon-1°**Pyh = Ph-ieesPg (Figure 2.3a). Each of these PEs is 
an element of exactly one of the n-cubes (the front and 


back faces in Figure 2. 3a) specified by the 


Non, n7partition, {i | @<i<n}. The N rows of matrix B are 


evenly distributed over the PEs of the n-cube defined by 
Xon-1+°++X,0---8 where x indicates a bit position which 
varies among PEs in the same n-cube (Figure 2.3c). This 
n-cube is one of the n-cubes specified by the partition 
{i | n<i<2n} which is, of course, the complement of the 
Non, n7partition above. By Corollary 2.la, the n-cube in 
which B is distributed shares exactly one PE with each 


n-cube containing exactly one row of A; this is the 
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"favorable" input distribution requirement. 


2.3.2.2 


3. 


The Matrix Multiplication Algorithm - 


Transpose B to form Bt over n-cube XonapeeeXp_ Gee B 
using a N(N-l) transfer in N log N steps (Figure 


2.3c). 


N-way broadcast each row of Bt residing in PE 
Pon-1°**PnF- +B to all PES in the n-cube 
Pon-1-e*PnXn-1°*°%g in N log N steps (Figure 
2.3d). This can be done for all rows at the same 


time, since the n-cubes are distinct. 


N-way broadcast each row of A _ residing in PE 
Pon-1e**PnPn-i-e*Pg to all PEs in the n-cube 
Xone *¥nPhi-y ss *Pg in N log N steps (Figure 
2.3b). This step can also be done for all rows at 


the same time. 


Each PE now contains a row of A and a column of 8B 
and can form the inner product in O(N) steps 


(Figure 2.3e). 


Using a "reverse" N-l linear transfer, the N 
elements of each result row can be _ brought 
together within the same PEs which initially held 


a row of A in O(N) steps. This step can also be 


LF 


done for all rows at the same time. 


The total is (3N log N) + O(N) = O(N log N). 


2.3.3 Quicksort of N Distinct Elements - 

The analysis of quicksort presented here is simplified 
because there is no need to repeat the work of others. It 
is well known the average time complexity of quicksort on a 
single processor is O(N log N) while the worst case 


complexity is o(n?) {24]. Let the N (=2™ without loss of 
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generality) distinct numbers to be sorted reside in an. 


arbitrary PE of an n-cube. Assume this vector was 
transmitted ‘to that PE on a hypothetical link cube _,. The 


quicksort algorithm is: 


For each PE receiving a vector to be sorted on link 
cube, do 


1. Let A be the input vector received by a 
particular PE on link cube, ; 


2. For i from j+l to n-1 do 


a. Select the median of A which can be done 
in O(length of A) [l, p.97]; 

b. Construct (within the same PE) a new 
vector A' using elements from A which are 
less than or equal to the median; transmit 
all other elements of A using link cube, 
as a vector to be independently sorted; 


c. Let A=A'; 


3. For i from n-1 downto j+l do 
Concatenate A with the sorted vector 


received on link cube, to form a new A ; 


4. Transmit the resulting sorted vector A on 


link cube,, 


To see how this works, imagine the n-cube is split into two 
disjoint (n-1)-cubes. The source PE splits the input 
vector into two equal parts and transmits one of the parts 
to the other (n-1)-cube where it is independently sorted. 
Each N/2 element vector is then split again and one-half is 
sent to an (n-2)-cube to be independently sorted, and so 
on, Until the length of each vector is one. No PE (or 
link) does more than O(N) work in this splitting phase. 
The vectors are then concatenated by reversing the above 
process starting with vectors of length one and ending with 


a vector of length N. Again no PE does more than O(N) 
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work. Therefore, n-cube quicksort requires O(N) time. 


2.3.4 Binary n-cubes and Mesh Connected Computers - 

Before analyzing a portion of a partial differential 
equation problem, we will briefly explore the relationship 
between mesh connected computers (MCC) and n-cubes. A MCC 
is an interconnection of N = 2" identical PEs. The PES are 
arranged in a q-dimensional Sg-1X+ ++ XSg array, where each 


S; is a power of two and Sq-1 *+++*Sg =N. A PE address is 
expressed in standard coordinate indices as Bat ea Se 
tgde GSi,<sy, O<k<q. Each PE(ig_y, «++» ix, seer ig) is 
connected to its nearest two neighbors in each of q 
dimensions PE(i,_1, «seer itl, s++ ig), OXk<d, provided 
they exist. PEs at the boundaries of the mesh have fewer 
than 2q connections unless the MCC is specified to have 
"wraparound connections". Each PE in a MCC with orthogonal 
wraparound (OW) is connected to exactly 2q neighboring 
ara ar eser ((i,tl) mod sy), «ee, ig), O<k<q. These 
definitions are due to Nassimi and Sahni who consider 
optimal routing on MCCs [38]; fast sorting algorithms on 
MCCs are discussed in {29,39]. Illiac IV is a 
two-dimensional MCC with slightly different wraparound 
connections called “propogating wraparound" [38]. 

.An interconnection network can be represented by a 
directed graph denoted by {V, E} where Vis a set of 


vertices (PEs) and E is a relation which is a subset of VxV 
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representing edges (connections) between vertices. Since 
only bi-directional links are considered here, E will 
automatically bea sonaeenie relation, i.e., if (v1, V>) € 
E then (vo, v]}) € E. A network {V), E,;} is a subnet of 


network {V|, E 5} if E; © E> (note the set of vertices is 


the same). Network iV,, E,} is topologically equivalent to 
network {V,, E,} if there exists at least one function f 
called a re-labeling function satisfying: 
a) f is one-to-one and onto from domain vi to range V>; 
b) WW ((a, b) € Ey) (f:a, fib) € Eo; 


c) WW (x, y) € E5y (ettsx, f7try) € By. 


Theorem 2.2. An OW MCC of size N is a subnet of an n-cube 


of size N. 


Proof: Consider the concatenation of the binary 
representations of the MCC coordinates a-1eeelg to be a 
binary PE address. Let Gs(x) represent a j-bit Gray code 
mapping the integers x, B<x<2), into the corresponding Gray 


code value. The re-labeling function f£ maps from MCC 


address i i = : 
q-is:esig to nm-cube address Gj oq Sg-1 4q-1) eee 

Si og sy (ig)- f is clearly one-to-one and onto by a_ simple 

combinatoric argument. Next consider an arbitrary element 


of the MCC interconnection relation Cg yreeeripveserig)s 


(loi coe, i,+l, cise ig)) Which is mapped by f to 
(( 
(G 


G : . ° 
log S-) a-1)s oeoe, Slog s, (1k) » eooer Slog sg(to))e 


log Sy) q-1)s coer Glog s, (ixtl), coer Glog s_(ig)))- 
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This link is an element of the n-cube interconnection 
relation since, by definition of a Gray code, consecutive 
Gray code values vary in exactly one bit position which 
conforms to the definition of the n-cube interconnection 
functions. Thus under the re-labeling function f, an MCC 


is a subnet of an n-cube. ia 


Pease proved a similar result for a more general network, 


the “indirect binary n-cube" [31]. 


Corollary 2.2a. An OW MCC of size N such that Ss, = 4, 
®<k<q, is topologically equivalent to an n-cube of size N 


(e.g., Figure 7 ee 


Proof: The number of OW MCC connections per PE is 2q and 
the number of n-cube connections per PE (log N) is the same 
when S. = 4, ®<k<q. By Theorem 2.2 (the OW MCC is a subnet 
of the n-cube) the two networks must be topologically 


equivalent. (] 


A re-labeling function is said to configure one network 
into another network. When the domain of the re-labeling 
function is physically part of a larger or more connected 
network, PES and connections not included in the domain 


network may be ignored rather than physically deleted. A 


7 i ' 

When all S,=2, the OW MCC and n-cube are also equivalent 
since an n-cube is defined as a (q=n)-dimensional cube 
of side two. 
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single network may thereby be configured into a number of 
other networks by distinct re-labeling functions acting on 


possibly different subnets of the original network. 


Corollary 2.2b. Let an n-cube be configured as a 
$g-1%- + XS MCC. Then this n-cube can be split into two 
disjoint (n-1)-cubes by a hyperplane bisecting the MCC in 
any one of the MCC's dimensions of size S, provided s,>2 


(e.g., see Figure 2.5). 


Proof: Re-label the PEs from MCC addresses to n-cube 
addresses as in Theorem 2.2 but with special consideration 
given to the Gray code mapping coordinate index iy where k 
is the index of the bisected dimension. Let LMB represent 
the left most bit position in the binary representation of 
i,. Select the Gray code such that bit;yp(i,) = @ for 


O<i,<s,/2, and bit = 1 for S,/2<i,<sy (Q@.ge, a 


ump (4,) 

Standard reflected Gray code). Then the 2n n-17Partition = 
’ 

{j § @ <j <n} - {the position of LMB in an overall n-cube 

address} specifies the desired partitioning of the n-cube 


into two disjoint (n-1)-cubes. {] 


Corollary 2.2b may be applied recursively to partition 
an n-cube configured as a MCC into many different m-cubes 
of various sizes (each a power of two). These m-cubeS may 
be MCC hypersolids or hyperplanes ranging in size from half 
of the MCC to parts of individual rows of the MCC. In _ the 


sequel, MCC hyperplanes will be denoted by listing the 
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Po ENB peti ov 


fixed coordinates within parentheses e.g., (i,=5, ig=0). 


2.3.5 Partial Differential Equation Complexity - 

The “nearest neighbor" connections of MCCs fulfill a 
major part of the communication requirements for iterative 
solutions to partial differential equations (PDE). 
Although convergence, stability, etc. complicate the issue 
{17], for simplicity the analysis here considers only 
nearest neighbor communication requirements. For this 
Purpose, an n-cube of appropriate size is configured as an 
OW MCC in accordance with Theorem 2.2. Unlike most studies 
of PDE solutions which assign one data element to each PE 
of a MCC, here a row of data elements is assigned to each 
PE; again the motivation is to meet the requirements of 
the implementation to be presented in Section 3. 

Consider a q-dimensional Sq-1X+ +» XSg PDE problem where 
each S, includes boundary data at indices § and s,-1. This 
problem may be mapped onto a q-dimensional MCC of size 
Sq-1X- XS 4X O(Sg) such that each data row (ig-jreeert)) of 
size S, is placed in PE(ig_j,-++,i3) in MCC hyperplane 


(ip=6), e.g. see Figure 2.6. The computation then 
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progresses along the dimension of size 0(S5) Such that MCC 


hyperplane (i 5=k) contains the state of the problem at 
iteration k. An informal description of the data movement 
is as follows, where z is the actual value of O(S5) and a 


PE is called active when the necessary data are available: 


For k from 1 until (convergence is achieved) do 
EOE J Sham 2 hO SB ya do 
Each active PE sends the jth element of its row to 
each of its nearest 2(q-1) neighbors in MCC 
hyperplane (i,=k) provided they exist. When the 
necessary data is present, each active PE computes 
an element for the next k-loop iteration, and 
passes this result along the dimension of size z to 
its neighbor in MCC hyperplane (i 5=(k+1) mod 2), 
i.e., orthogonal wraparound connections are used 


when k=z-l; 


For each element produced in a k-loop iteration, an 
interior PE will thus send 2(q-1) elements to neighbors, 
receive 2(q-1) elements, and send one result element to the 
next hyperplane. The PE computation for each such result 
is assumed to take at most O(q) time and thus the time 
required to produce the first elements for iteration ktl is 
O(q). The PEs in hyperplane (ij=k+1) which receive this 
data may then begin to exchange data and compute results as 
soon as “he first few elements of each input data _ row 
arrive. The final computation for each row of size S, may 
thus be thought of as being carried out by a 0(S,)-stage 
circular "pipeline". The time required is 0(qS 5) to 
initially distribute computation in the pipeline. All 
Stages may then compute concurrently to finish in O(qT) 


time where T is the number of k-loop iterations required 
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for convergence. 


PDE computation. 8 


8 ; : : 
The time required for the PDE problem using a 


number of 
number of 
algorithms) 


PES where 


each PE is assigned only 


data elements (as in most array 


is O(qT). 


The total is O(q(s, + T)) for the overall 


comparable 
a constant 
computer 
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3.8 ONE IMPLEMENTATION OF THE PARALLEL COMPUTER MODEL 


In the previous section, several complexity analyses 
were derived based on a simple parallel computer model. 
This simple model is not proposed as an implementation 
Since machines based on centralized control often lack the 
flexibility, ease of programming, and extensionality 
desired for general-purpose computation. 

The present section shows that the control of a 
machine based on the n-cube model can be decentralized with 
minimal effect, in the best case, on time complexities 
derived in Section 2. Of course, flexibility and ease of 
Programming are quite subjective and no proofs can be 
Presented for the claimed improvements. Instead, the 
following characteristics of the proposed machine are cited 
to support the claim: 

1. The machine is to be programmed in the high level 
language, Id (Irvine dataflow) [3], instead of the 
assembly-like languages usually required for 
effective use of other multiprocessor computers. 
Id provides for transparent expression of 
parallelism (i.e., parallel operation is the 
default mode rather than the exception); Id is 
also side-effect free (functional) and shares many 
of the advantages of other applicative languages 


such as FFP [6], pure LISP [27], and LUCID [5]; 


2. Automatic memory management is provided along with 
a structured data type; 


3. Id programs are independent of the number of 


Processors or their interconnection. 


Decentralized control has been demonstrated in a 
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number of dataflow systems [12,14,15,22,42]. However, 
analyses of these systems has not yet produced time 
complexity results as good as those derived in Section 2. 
The sequel describes how the n-cube model and a dataflow 
system [22] can be combined to obtain the benefits outlined 
above. Dataflow is asynchronous by definition and_ thus 
each PE in the proposed machine will communicate 
asynchronously without centralized clock or control. Since 
the analyses in Section 2 depended on a central clock, the 
results of those analyses represent best cases for the 
asynchronous system. Hence, the complexity results derived 
in the sequel are not intended to prove an actual dataflow 
machine would attain these best case results because 
providing adequate scheduling may be difficult; rather the 
purpose of the analysis is to suggest that time complexity 
analysis is indeed a useful design and evaluation tool 
Since by its use major bottlenecks in previous iterations 
of the architecture have been systematically identified and 


eliminated. 


3.1 Overview of the Machine's Operation 


As mentioned above, the dataflow machine is an 
asynchronously interconnected n-cube of N PEs. Although 
the distributed PE memory is organized as one address space 


each PE is solely responsible for managing its own random 
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access memory. Program code is compiled into a data 
structure, so the following applies to both object code and 
program data. The location of each data structure is 
specified by a unique identifier (pointer) which may be 
passed anywhere in the machine. When actual data is 
required, the requesting PE forwards a message to the PE 
where the data is located; the receiving PE services’ the 
request by sending back the requested data. 

The following is a brief summary of the execution of 
Id programs; details may be found in [3, 22]. A compiled 
Id program is a directed graph where each node represents 
an operation and each link indicates that the result of one 
operation becomes the input to another. An operation can 
be any (side-effect free) function which consumes one set 
of inputs and produces one set of outputs. An execution 
instance of an operation is called an activity and each 
activity is given a unique activity name. Each value 
resulting from an activity's execution is concatenated with 
the value's destination activity name into a packet called 
a token. Destination activity names are computed from the 
activity names of input tokens according to a set of rules 
located in each PE called the U-interpreter. 

All input tokens to an activity must be directed to 
the same PE even though those input tokens may have been 
produced by many distinct PEs. The U-interpreter ensures 


that all tokens destined for the same activity have 
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identical activity names. An assignment function is used 
by each PE to map the activity name of a result token to a 
physical PE address.?2 This address is then used to direct 
the token through the interconnection nctwork to the PE 
holding the destination activity. Different assignment 
functions may be used concurrently in the machine so long 
as all PEs which are to send tokens to the same activity 
use the same assignment function. 

A PE may contain many activities and each activity may 
be in any one of several stages of completion. The first 
stage commences when the activity's first input token 
arrives and lasts until sufficient input tokens are present 
to enable execution to begin. The activity then progresses 
through a series of stages which include operation code 
fetch, data fetch (if needed), operation execution, and 
output token generation and transmission. Activities are 
"multiprogrammed" within a single PE so that temporarily 
blocked activities (e.g., awaiting data fetch) need not 
monopolize execution resources. This capability allows 
flexibility in assigning processor resources since a single 
PE is sufficient to execute an entire program (assuming 


sufficient memory is available). 


sa detailed discussion of several specific assignment 
functions and their effects may be found in [22]. 
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3.2 Structures and Data Communication 


The ability to specify computation on data types such 
as arrays, lists, records, etc. is often crucial for 
convenient expression of algorithms. In Id, such data 
types may be represented by structures. A structure is a 
set of (selector, value) pairs where a selector is an 
integer, and a value is any value, including another 
Structure [13]. < > denotes the empty structure while 
<isx, jtv, .e.> represents a structure with value x at 
selector i, value v at selector j, etc. Two functions are 
defined on structures. The select function (denoted by 
x{i]) has two arguments, a structure x and a selector i, 
and yields ‘the value at selector i. The append function 
has three arguments: a structure, a selector, and a value 
to be appended to the given structure at the specified 
selector. Append does not modify the given structure but 
instead makes a copy of it with the new selector and value 
placed appropriately. Various implementations of 
structures are discussed in [21]. For simplicity, only one 
of these implementations, the vector representation!9, is 


discussed in the sequel although generalization to the 


B-tree representation [21] has many advantages, e.g. the 


me ee ee ee ee ee ee ee ee ee we ee we ee 


eer [21] this implementation was called "“array" 


representation. 
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representation of Sparse arrays. 

In this paper, the following implementation of 
Structures is assumed. Each structure is associated with a 
unique address (pointer). In vector representation, a 
contiguous vector of memory cells is allocated to contain 
the elementary values or pointers to substructures which 
collectively comprise the elements of the structure. 
Select is implemented by indexing in the usual way. 
However, append in general requires a copy be made of the 
entire vector and the new value placed appropriately. 
Substructures need not be copied as is shown in [13,21]. 
Copying of the original vector can also be avoided when 
only one pointer exists which refers to that vector. In 
this case, append may safely update the vector in place. 

Figure 3.1 shows a structure representing a NxN matrix 
where the (selector, value) set nearest the root is called 
the top level, and the collected substructure (selector, 
value) sets is called the bottom level. Level names can 
also be generalized to q-level structures where the top 
level is the first level, the next level from the root is 
the second level, and so on until the gth level. Hence a 
level name is derived from the path length from the root to 
the named level. 

For the moment, the important problem of storage 
reclamation of structures will be ignored. We assume all 


object code and data aggregates are represented by 
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structures, and furthermore that the programmer specifies 
(perhaps by type declaration) the largest number of 
elements each structure may hold so that sufficient space 
is initially allocated to contain it. The n-cube message 
transfers of Section 2 can then be implemented using 


Structure operations as will be shown below. 


A transfer begins with a source message configuration 


and terminates with a destination message configuration. 
The inverse of a transfer begins with the original 
destination message configuration, reverses each step of 
the original transfer, and terminates with the original 
source message configuration. Clearly the inverse transfer 
requires the same time as the original transfer. In a 
transfer based on structures, the values to be transmitted 
are grouped into a structure and a pointer to that 
Structure is distributed to the PEs which are to receive 
one or more of the values. Each PE then sends a request 
(for each value it requires) to the appropriate PE which is 
holding that part of structure containing the required 
value. PEs receiving such requests service them by 
selecting and replying with the value requested. Each PE 
acts independently, but the collective effect is called a 
request/acknowledge transfer. Each of the n-cube transfers 
can be implemented by the request/acknowledge mechanism 
without changing the best case order of time complexity. 


Consider first the N-way broadcast and an additional 
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implementation mechanism called a cache tree.1l A cache 


tree is a distributed cache which automatically configures 


itself into the logical tree appropriate for each data item 


broadcast. The PE holding the value to be transmitted is 
the root of the broadcast tree; the tree also includes 
each requesting PE as well as all PEs in the paths from the 
requesting PES to the root PE. The cache tree may be 
implemented by an associative memory table in each PE. An 
entry in the table consists of a two part key (a structure 


pointer s and a selector i) and a data field containing the 


value S; if it is available; otherwise, the data field 
contains a pointer to a locally held list of requests 
received from other PEs for that same value Si. When a PE 
receives a request, it looks up keys s and i in the cache. 
If the value is found, the PE replies as if it were the 
root PE with the value S,; otherwise the request is added 


to the list of requests for that value. If the list was 


1the cache tree was independently developed by Sullivan, 
Bashkow, Klappholz, and Cohn who called it a “conflict 
filter" [38]. However, Dr. BasShkow has indicated (by 
personal communication) that the cache tree is not included 
in current designs for the CHoPP machine, perhaps because 
of the difficulty of maintaining consistency in multiple 
copies of data. This problem does not arise in a 
functional environment such as dataflow because values are 
never modified; in practice this means cache values are 
read-only although redundant values other than the original 
source may be deleted, for example by a least recently used 
policy, without affecting correct operation. 
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previously empty, the request is also forwarded toward the 
root PE. When the response value S, is received, the PE 
enters the value in its cache and sends a copy of the value 
for each of the requests in the list associated with keys s 
and i. An instance of a broadcast tree is thus dynamically 
constructed as requests filter toward the root PE and no PE 
need receive more than log N requests. 12 After the tree is 
constructed the lists of requests in the caches are used to 
direct the actual broadcast of the data item. In the 
n-cube network, any PE can be the root for a broadcast and 
many such broadcasts may be progressing simultaneously.}3 
Clearly the best case order of time complexity for the 
cache tree N-way broadcast does not change over the 
broadcast of Section 2 since communication time is at most 
multiplied by a constant. 

Next consider the N(N-1) transfer. (Discussion of the 


teohe order of request transmission must be carefully 
scheduled in the inverse broadcast phase to ensure no more 
than log N requests are actually received by any one PE. 
Achieving such optimum scheduling is difficult; however, 
scheduling policies approximating the desired behavior may 
Prove to be adequate. 


135 imultaneous transfers can be "timesliced" so that the 
total time required is the sum of the individual transfer 
complexities. 
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N-l1 linear transfer will be omitted since a_ similar 
argument applies to it as well.) Let the N values to be 
transferred from each PE constitute one row of a NxN matrix 
which is represented by a structure as in Figure 3.1 where 
the top level may be located in an arbitrary PE of an 
n-cube. In the N(N-1) request/acknowledge transfer each 
destination PE requires a pointer to every row of the 
matrix. Hence in the first phase of the transfer, all PEs 
request via a cache tree a pointer to the first row 
requiring a best case time of O(log N), then the second 
row, and a on until the Nth row for a total best case time 
of O(N log N). In the second phase, each PE sends N 
requests for the N values it is to receive. These requests 
collectively form an (inverse) N(N-l) transfer. In the 
third phase each PE holding a. row of values to be 
transmitted selects and sends the requested values 
collectively forming another N(N-1) transfer. Each P& 
receives N values and appends them into a vector; each 
such append requires only constant time because only one 
pointer to the structure being formed need exist. Pointers 
to each result row may be collected by an inverse N-1l 
transfer and appended together in O(N) time to produce a 
new NxN matrix represented by a structure. The total best 
case time required for the request/acknowledge N(N-1) 
transfer remains O(N log N) since each step requires no 


more than O(N log N) time. 
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3.3 Storage Reclamation 


The above mechanisms are adequate to ‘implement the 
transfer algorithms of Section 2 without affecting the 
order of time complexity if storage reclamation is ignored. 
Of course storage reclamation must be dealt with in a 
practical machine. The reference count scheme [18] is 
often proposed [13,21] for storage reclamation in dataflow 
because structure definitions preclude circular references. 


In the reference count scheme, each structure has an 


associated non-negative integer called the reference count 


indicating the number of copies of the pointer referring to 
that structure. The reference count is incremented and 
decremented -as copies of the pointer are respectively 
created and consumed. When the count is zero, the 
structure is no longer needed and may be reclaimed. 
However, in a distributed processor environment the 
classical reference count scheme incurs’ substantial 
communication overhead when copies of pointers are made 
since a request/acknowledge communication is required to 
update the reference count before the new pointer may be 
released.!4 this communication overhead is unacceptable as 
can be shown by considering the cache tree above. Suppo se 
the value to be broadcast is a pointer to a structure. 
Then each time an internal node in the cache tree 
replicates -his pointer, it must first send a request to 


the PE holding the structure to increase the reference 
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count. Since N-1 copies of the pointer will be made in a 
N-way broadcast, the PE holding the structure requires O(N) 
time to process the reference count; thus reference count 
processing can increase broadcast time from O(log N)~ to 
O(N). 

A generalization of classical reference counting 
called weighted reference counting may be used to reduce 
such overhead [2]. In this scheme, an arbitrary positive 
integer called the pointer reference weight (PRW) is 
associated with each instance of a pointer. Corresponding 
to the reference count in the classical scheme, each 
structure has an associated non-negative integer called the 
structure reference weight (SRW) which is the sum of the 
PRWs of all pointers referring to that structure. (In the 
classical scheme, all implied pointer reference weights are 
equal to one and thus the SRW is the same as the reference 
count.) As in reference counting, when a _ pointer is 
destroyed its PRW must be subtracted from the referenced 
structure's SRW. However, when m copies of a pointer with 
a PRW equal to x are made, if x>m then no change to the SRW 
is required. In this case the pointer's PRW may be "split" 


14one PE making the copy must wait for acknowledgment that 
the reference count was actually increased. Otherwise, the 
asynchronous operation of the machine could allow a 
reference count decrement to occur from the destruction of 
an otherwise unrelated instance of the pointer; this could 
lead to premature reclamation of the structure. 
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into arbitrary positive integers Xi,eeerX, Such that x = 


X,+..-+x,, where x; becomes the PRW on the ith copy of the 


i 
pointer. (The original instance of the pointer is 
destroyed.) 

Since changes to the SRW may be avoided when copies of 
a pointer are made, the reference weight scheme can 
dramatically reduce time overhead. For example, reconsider 
the problem of broadcasting a structure pointer to N PEs. 
If the PRW of this pointer is at least N then the broadcast 
can be done in O(log N) time because splitting the 


pointer's PRW at each internal node of the cache tree 


increases each PE's work by only a constant factor. 


The cache tree can also reduce time overhead in a SRW 


decrease operation. Suppose N PEs each request the value 


S; from structure s. Each request contains a pointer 


referring to s and since select destroys the pointer it 
receives aS an argument, the PRW of the pointer referring 
to s in each of the N requests must be subtracted from the 
SRW of s. This can be done in O(log N) time by having each 
internal cache tree node accumulate the decrease in SRW as 
it services incoming requests for Si. However, when an 
internal cache tree node has no a priori information about 
the number of Ss; requests it will receive, it is not 


convenient to accumulate all such decreases in SRW before 


sending the first request for S; to the root PE. This 


problem can be handled by delaying transmission of decrease 
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SRW messages until copies of S; have actually been 
delivered to the PEs which originally requested S,;. These 
PEs then initiate an additional inverse broadcast to sweep 
together all decrease SRW messages into one decrease SRW 
value which is forwarded to the PE holding structure s. 
The total time required for the request/acknowledge data 
broadcast and a broadcast to accumulate the SRW decrease 
messages remains O(log N). Other policies with various 
tradeoffs in timeliness of storage reclamation versus 


concurrency potential are also possible using the reference 


weight scheme. 


3.4 NxN Matrix Multiply and N Element Quicksort Complexity 


The above mechanisms are adequate to implement’ the 
matrix multiply and quicksort algorithms of Section 2 
without affecting the best case order of time complexity. 
(The PDE problem is considered in a later subsection.) An 
Id program for matrix multiply is given in Figure 3.2 where 
the more readable syntax “new x{i] — v" represents “new x 
<_ append(x,i,v)" for appends within loops and 
"x [145 ¢eceym)" represents “select Caee - select 
(select(x,i), j).e-, m)" within expressions. This program 
differs from a conventional matrix multiply program in two 
respects. First, the matrices are represented by 


Structures (Figure 3.1) and thus A[i] returns a pointer 
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referring to the ith row of matrix A. Second, since 
structure representation biases element access (i.e., in 
lexicographic order) the B matrix is first transposed 
before the multiplication is performed so that rows of Bt 
represent columns of B. 

The Id program for matrix multiply (as well as all 
other programs considered in this paper) is independent of 
the size of the input. Thus the program can be distributed 
using a cache tree to nu? PEs in O(log N) time; hence all 
PES are assumed to hold a complete copy of the program 
although this would not necessarily be the case in a real 
machine. As was discussed earlier dataflow code is a 
collection of interconnected functions and copies of each 
function may begin execution when the necessary arguments 
arrive. However, for the purpose of complexity analysis it 
is sometimes helpful to view the initiation of function 
execution differently. A PE is said to initiate the 
execution of a function if it supplies all arguments 
required for that function. A part of each of the 
following complexity analysis determines the time for a 
single PE (the starting point for the computation) to 
initiate the various parts of a program using the transfer 
mechanisms of Section 2. 

The following concerns the time complexity of matrix 
transpose (Figure 3.2). The rows of each matrix are 


assumed to be distributed as described in Section 2.3.2.1 
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while the top level of each structure may reside in any one 
of the PEs also holding a row of the same structure. The 
transpose procedure and its nested i-loop may be initiated 
in any PE holding a row of matrix B. Under the 
U-interpreter, the i-loop then initiates N copies of the 
j-loop each in a distinct PE holding a row of B in O(N) 
time using a N-l linear transfer. The transpose can then 
be completed in O(N log N) time simply by using a 
request/acknowledge N(N-1) transfer. 

In the multiply procedure, each i-loop initiates N 
copies of the j-loop in N distinct PEs and each j-loop then 
initiates N copies of the k-loop for a total of n2 k-loop 
initiations in N2 pgs all in O(N) time. The time for 
structure access is determined in the following. Since 
row A is one of the inputs to a k-loop initiation, 
sufficient copies of pointers referring to rows of matrix A 
are made by the U-interpreter without increasing the order 
of time complexity of program initiation (assuming the 
reference weights of the original pointers are large 
enough). Each k-loop initiation then uses a single copy of 
one of these pointers N times to request the N elements of 
a row of A. The case for matrix B is slightly different 
since Bet 4] occurs in the j-loop which generates a total of 
n2 requests for pointers. These requests may be satisfied 
in O(N log N%) = O(N log N) time using a cache tree. Thus 


all requests for pointers may be satisfied in O(N log N) 


42 


time. The actual data elements are then acquired by 
transfers as described in Section 2 but using 
request/acknowledge communication. Pointers referring to 
rows of the result matrix can then be appended together 
using a N-l linear transfer in O(N) time. Thus matrix 
multiply can be done using structures to represent 
matrices, a high level Id program to represent’ the 
algorithm, and the pointer reference weight scheme to 
provide for storage reclamation without increasing the best 
case order of time complexity over that derived in Section 
Zs 

An Id program for quicksort is given in Figure 3.3. 
Since the numbers to be sorted can be represented in a 
one-level structure, no special consideration for structure 
representation is needed. Thus the data transfer algorithm 
given in Section 2.3.3 is directly applicable in this case 


to give a best case time complexity of O(N). 


3.5 I-structures 


If an iterative solution to the PDE problem is 
programmed using a q-level structure, the pipelining and 
hence the degree of parallelism described in Section 2 will 
be lost. Consider’ the compilation!> of a simple Id loop 
which builds a structure x (Figure 3.4). The L and L-1 


boxes generate and strip away, respectively, context for 
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each loop initiation much like conventional systems 
generate and strip away context for each initiation and 
return of a procedure. Similarly, the D and ‘p-l boxes 
generate and strip away a unique context for the values 
within each iteration of the loop. Such context changes 
are directed by the U-interpreter and need not be 
considered further here. Boxes with internal values’ such 
as < > or v produce that value when triggered by any input 
token. The &) operator performs the identity function | by 
passing each input token directly to its output port. The 
switch operator decides to which output port (T or F) each 
input token is to be sent based we the corresponding 
boolean valued token received at its side port. Forks in 
lines indicate that the token input to the fork is to be 
replicated so’ identical tokens are placed on each output 
line. 

The point of interest in Figure 3.4 is that the output 
of the append box, new x, is circulated on each iteration 
of the loop and thus x does not appear on the loop return 
line until the loop terminates. This ordered construction 
of x is required by the semantics of Structures as can be 
seen in the following example. Replace the third line of 


the program in Figure 3.4 with “new x € append(x, f(i), 


+214 compilation is discussed in detail in [3]. 
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g(i))". Since £(i) need not be a one-to-one function, the 
selectors used for the m appends need not be distinct. 
Hence the final value at each selector of structure x is 
not known until the loop terminates. The effect of these 
build-before-use structures on the PDE problem is to delay 
initiation of the computation for the next iteration of the 
outer loop until the current iteration is complete, and 
hence no pipelining between outer loop iterations is 
possible. 

One solution to this problem is to use I-structures 
instead of structures.16 I-structures may be regarded as 
structures constructed in a restricted way. In the sequel 
only one operational semantics and implementation of 
I-structures is considered. Arvind and Thomas present a 
more complete theory of I-structures and compare 
I-structures with other functional data types [4]. The 
restriction on I-structure construction used here is that 
the value at each selector of a particular structure may be 
appended to at most once (termed the single assignment rule 
for selectors).1? 

The single assignment rule suggests an I-structure 


implementation which allows values to be selected from an 


16 : "4 

Another solution is to use a data type called “stream" 
[3] instead of structures. However, streams are often 
inappropriate for expression of numerical algorithms [4]. 
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I-structure before that I-structure is complete. Figure 
3.5 shows the compilation of the same program fragment as 
in Figure 3.4 but where x is considered an I-structure 
instead of a structure. The I-structure pointer generator 
box allocates memory for the I-structure (the bounds of the 
I-structure must be supplied), initializes the value at 
each selector to the not-present or empty valuel8, and 
sends out two pointers referring to the I-structure. For 
error checking these pointers wee marked "read-only" or 
"updateable" since in this simple model only the inside of 
the loop is allowed to append to the I-structure. This 
allows the clean up box to convert the I-structure to a 
Structure when the loop terminates by changing all empty 
values remaining in the structure to the undefined value. 
Note that the output of pointer x does not depend on 
termination of the loop. ?9 Thus values at individual 
selectors of an I-structure may be selected from outside 


lTonis rule differs from the single assignment rule for 
program variables [9] since the validity of the rule for 
program variables may be determined at compile time. The 
validity of the single assignment rule for structure 
selectors cannot in general be determined until execution 
is complete as was shown in the example above where 
function f(i) determined the selectors. 


18, iternatively, memory could be mass initialized before 
allocation and then each "memory block" would be 
reinitialized on deallocation. 
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the loop as each value is appended to the I-structure 
within the loop. 

Implementation of I-structures is similar to 
Structures except a presence bit is associated with each 
I-structure selector. This presence bit is checked when a 
select is attempted from an I-structure. If the presence 
bit is on, select simply returns the value at that 


selector. However if the presence bit is off, the value at 


that selector is really a pointer to a list of select 


requests for that value. Each such request is delayed by 
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adding the request to the list; the PE servicing that | 


request may then go on to other tasks. When the value 
eventually becomes available through an append operation, 
each request on the list for that selector is satisfied by 
sending a copy of the selected value. Append also checks 
the presence bit when appending to an I-structure. If the 
presence bit is already on for the selector being appended 
to then the single assignment rule for that selector has 
been violated and an appropriate error message may be 


issued. 


"since termination of I-structure programs does not depend 
on termination of embedded loops, I-structure programs are 
more defined in the sense that an I-structure program may 
Produce results when an otherwise equivalent structure 
program does not. 


| 
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3.6 Partial Differential Equation Complexity 


I-structures allow the pipelining desired in the PDE 
problem since I-structures need not be complete before 
values are selected. The following considers the time 
complexity of an I-structure program which solves for one 
variable of a q-dimensional Sg-1X-- 2X8 PDE (Figure 3.6). 
Assume the initial data is laid out ina Sg-1X-+-XS1X 0(sg) 
OW MCC as described in Section 2 but where the data is 
represented as a q-level structure Xg. Recall each s; 
includes boundary data at indices 6 and S;-l, while the 
coordinate indices i gnireeerlg indicate a PE address in a 
MCC, and a list of fixed coordinates in parentheses 
represents a MCC hyperplane. Assume the levels of 
structure X, are distributed such that Xglig-rreserIm)e 
O<m<qr is located in any PE in MCC hyperplane (igo =5q-1" 
seer Li=jme ig@) where m=q means the only restriction is 
i530, @.g. Figure 3.7. Thus the top level may reside in 
any PE in hyperplane (ij=9) while lower levels are 
restricted to the MCC hyperplanes from which most select 
requests for each structure will originate in the nearest 
neighbor access pattern of the PDE program (Figure 3.6). 
Data resulting from cacy euccesaing k-loop iteration is to 
be constructed as a q-level I-structure to allow 
pipelining. If the placement of the I-structures for each 


of these k-loop iterations also meets the criterion above 


(except ij=k instead of 6), then the computation can 
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proceed nearly the same as the PDE computation described in 
Section 2.3.5. 

The first step in the complexity analysis of the PDE 
Program is to determine the time for program initiation. 
Recall that T is the number of k-loop iterations required 
for convergence. Then the time for the program to unfold 
under the U-interpreter is O(T + Sqg-1t-+++8g) since the 
first PE will spawn T i-loops, 29 each of these will 
concurrently spawn S521 j-loops, etc. Initialization of 
the I-structures for all iterations can be done in time 
linearly proportional to program initiation time since’ the 
Size of the vector to be initialized within the loop in 
each case equals the number of subloops to be initiated.21 
In addition, after each I-structure is initialized a 
pointer referring to that I-structure is released in 


constant time (Figure 3.5) and is returned as a value to 


®rhis analysis assumes the i-loop including its nested 
loops may be initiated before all inputs to the loop are 
available. Otherwise, initiations of these loops would 
require initialization of the I-structure x from the 
previous iteration of k. In theory, waiting for this 
initialization tends to negate the advantage of 
I-structures over Structures. However, in practice the 
initialization could be made very fast relative to other 
operations in the machine. Such issues have been avoided 
in the analysis by assuming that not all inputs are 
required to initiate a loop. 
ih a Practice such uncontrolled initiation of k-loop 
iterations and allocation of I-structures would probably 
waste memory without improving performance over a _ policy 
which delays initiation so that no more than one (or a 
small number of) concurrent k-loop iteration(s) exist per 
MCC hyperplane. 
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the I-structure within the next outer loop. Thus all 
I-structures may be completely constructed except for the 
elementary values at the bottom level in time linearly 
proportional to program initiation time. The computation 
may then proceed as described in Section 2.3.5 except that 
q levels of structure x must be traversed to access each 
element x[i,j,-..,m]. The top level contains Sq-1 
selectors and by means of a_ cache tree all PEs in MCC 


hyperplane (i g=k) may be sent the required pointer values 


in 


time since there are Sg-1*++-*S; PEs in that hyperplane 


which is also a (log (Sy_1*.--*S1))-cube by Corollary 2.2b. 
Similarly, all requests for pointers in the next lower 


level can be serviced in 


time within each of Sg-1 distinct (log (Sg-2*-..*S,))-cubes 


synonymous’) by Corollary 2.2b with the 84-1 MCC hyperplanes 


Clo-isi ig=k), 8<}<Sg-1, and so on for each structure 
level until the level one removed from the bottom which 


requires 


0(S, log sj) 


time. By expanding the log term in each of these equations 
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to make them identical, the sum for all levels to satisfy 


these pointer requests simplifies to 


Assume the values x[{i,j,..e,m] are concurrently requested 
by all PEs for all k-loop iterations, followed by a request 
for values x[it+l,j,...,m], etc. for a total of O(q) values 
per PE, 22 The time complexity of the overall PDE 


computation is 


O(T + SgiitesstSg) + 


Olas, _y+...+8) log (Sg-] *-+-*5)))) 


plus the time to do the actual computation from Section 


2.3.5, 


O(q(S_+T) ) 


for a total of 


If N=S,_)=Sg-2=++-=5g then this equation reduces to 


O(q3 N log N + qT). In comparison, the complexity of the 


22Note that the caches would automatically tend to 
eliminate the need for actual traversal of all q levels for 
each of the 2q+l values required to compute each result 
value. In addition, the PDE program could be modified to 
minimize redundant top level selects as was done in the 
matrix multiply program. For simplicity, these options 
were ignored in the analysis since they have little effect 
on the overall order of time complexity. 
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same problem on a q-dimensional array computer where each 
dimension is also of size N is O(qT) while on a sequential 


machine the complexity is O(q na T). 
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4.8 EXPERIMENTAL RESULTS 


Previous sections have analyzed network transfers, 
Programs, and mechanisms in terms of computational time 
complexity. In this section analytic results on 
I-structures are supported with evidence from executing 
machine-compiled Id programs on the Irvine dataflow 


Simulator [22]. 


4.1 The Irvine Dataflow Simulator 


Although complete simulation of the architecture 
described in Section 3 would be desirable, so far this task 
has not been attempted. Instead the Irvine dataflow 
simulator was modified to independently test the utility of 
I-structures. Although the results are not directly 
applicable to the architecture described in this paper, 
complexity analyses indicate I-structures should reduce 
execution time of many programs on both architectures. 

The following is a brief description of the simulated 
architecture; details may be found in [22]. The Irvine 
dataflow simulator is a detailed deterministic simulation 
of a particular interconnection of PEs. Some PEs called 
memory controllers (MC) are specialized to manipulate 
structures and perform memory management. The 
interconnection network is shown in Figure 4.1 where points 


A and A‘ are connected together to form two 
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counter-rotating token bus rings. Each ring is partitioned 
into as many slots as there are PEs and each slot is either 
empty or holds one fixed-length token. Four PEs are 
connected together and to a memory controller by a local 
bus which carries structure access requests and responses. 
Each memory controller is directly connected to a private, 
conventional memory (M) organized as part of one unified 
address space. MCs are connected together by a global bus 
So every PE has indirect access to any data or code within 
the machine. A group of four PEs connected by a local bus 
to one MCC is termed a physical domain. The collection of 
PES and MCs connected by the same counter-rotating token 
bus is called a ring domain which is the largest group now 
simulated. Assignment functions (Section 3.1) are chosen 
So closely connected activities (e.g., the activities 
comprising an instance of a procedure or loop body) are 
confined to the same physical domain. Since tokens are 
transmitted on that token bus which provides the shortest 
distance path to its destination, such assignment functions 
tend to reduce communication traffic between physical 
domains thereby promoting unimpeded local communication 


within concurrently operating physical domains. 


54 


4.2 Experiments: I-structures versus Structures 


For simplicity all communication conflicts are ignored 
in the following complexity analysis intended to aid in 
understanding the experimental results (previous complexity 
results are not directly applicable to the simulated 
architecture). Consider once again a q-dimensional 
Sq-1X- ++ XSqg PDE program (Figure 3.6). As discussed in 
Section 3.5, if this program were implemented using 
structures the structure y and all of its substructures 
must be complete before the next k-loop iteration can 
begin. Since each k-loop iteration is dependent only on 
data from the previous k-loop iteration, recall that the gq 
nested loops within the k-loop may unfold under the 
U-interpreter and thus the time required to complete each 
k-loop iteration is O(Sq_) t+ +. +Sgtq) Therefore the total 
time required by the structure program is 
O(T (Sy _y+.+-+8g+q) ) where T is the total number of k-loop 
iterations. For the one-dimensional planar hydrodynamics 
code executed in the simulator?> this equation reduces to 
O(Ts,). For ease in understanding experimental results, 
the convergence test was removed and the number of k-loop 


23this code was donated to the University of California by 
the Lawrence Livermore Laboratory. The code is a 
declassified and simplified version of a program which 
simulates shock wave interactions by solving large PDEs. 
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iterations was artificially set to Sg giving a complexity 
for the structure program of 0(s,2). By comparison, an 
equivalent I-structure program requires O(T+s,) = 0 (Sg+Sg) 
= 0(S,) since iteration k+l may begin as soon as the first 
three values (in a one-dimensional problem) have _ been 
computed in iteration k, etc. 

To test this analysis the compilation of structure 
variables in Id loops was changed from the structure schema 
(e.g., Figure 3.4) to the I-structure schema (e.g., Figure 
3.5). The hydrodynamics source code was not modified in 
any way. A series of experiments was then conducted by 
varying the number of PEs in the machine for each problem 
size Soe The minimum execution time from these experiments 
for each problem size was plotted in a complexity graph 
showing the-change in minimum execution time versus problem 
size (Figure 4.2). This graph illustrates substantial 
execution time reduction when I-structures are used even 
for small problem sizes. Although the I-structure curve 
appears almost linear, communication conflicts in this 
architecture would eventually cause the curve’ to bend 
upwards for larger problem sizes thus showing a _ real 
complexity higher than 0(s,). However the coefficients for 


higher order terms in the actual complexity are small 
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enough so that these terms do not appear in the. 


experimental results when the simulator is in the "standard 


configuration" [22]. 


elimination, 


complexity communication 


: 2 : 
conflicts) leads to O(N’) time to solve N equations with N 


structure the time for an 


I-structure program is O(N). These analyses are supported 


(although not as dramatically) by the experimental results 


given in Figure 4.3. Note that Gaussian elimination 


; : : 3 
sequential machine requires O(N ) 
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5.8 CONCLUSIONS: 


Our goal is to design a general-purpose computer 
comprising large numbers of cooperating processors to 
reduce execution time without increasing software costs. 
Toward this goal we chose to base the design of the 
architecture on dataflow since the traditional von Neumann 
model appears little suited for large scale multiprocessing 
{6,13,208, 22]. 

New computer architectures are often justified by 
their “cleverness quotient" or by listing concurrent 
operations. The difficulty with such evaluations is either 
extreme subjectivity or irrelevance since cleverness or 
concurrency gains little when un foreseen bottlenecks 
dominate execution time. Instead we suggest complexity 
analysis as a tool appropriate for designing and evaluating 
multiprocessor architectures since complexity analysis 
quickly uncovers major bottlenecks. However, the 
complexity analyses in this paper have two shortcomings. 
First, large constants may be hidden in the O-notation used 
to simplify analysis; hence small computations may perform 
relatively poorly. Second, only a few numerical algorithms 
were analyzed and thus more work is needed to show that the 
architecture is indeed suitable for general-purpose 
computation. 

The asynchrony and decentralization present in the 


proposed architecture makes accurate analysis difficult and 
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thus salient features were abstracted to derive a parallel 
computer model more amenable to analysis. This model is 
representative of a class of models since the binary n-cube 
on which it is based can be directly simulated by a number 
of other networks. For example, Siegel has shown that each 
of the n-l sets of interconnections, cube, @<i<n, of an 
n-cube can be simulated with a "SIMD perfect shuffle" 
network in n+l steps while the set of interconnections, 
cube, requires only one step [34]. Since the parallel 
computer model allows simultaneous cube, transmissions in 
one step (i.e., each transmitting PE may independently use 
a different link i) and there are n-l such i # @, the 

total for the SIMD perfect shuffle to simulate the parallel 
computer network is (n-1)(nt+l)+1l = n@, Thus complexity 
results derived for the parallel computer model are _ valid 
for the SIMD perfect shuffle when multiplied by at most n2, 
In addition, the binary n-cube was found to have a number 
of properties facilitating parallel complexity analysis 
including regularity, concurrency potential, and ease of 
creating partitions. However, the centralized clock of the 
parallel computer model detracts from machine modularity 
and programming flexibility. Section 3 of the paper showed 
that the clock and other control could be decentralized 
with minimal effect on the best case order of time 


complexity where some of the mechanisms proposed to do this 


included request/acknowledge transfers, weighted reference 


ui 


counting, cache trees, 


and I-structures. 
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Figure 2.1 
Examples of n-cubes 


3-cube,N=8 


4-cube,N=16 
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en 


Figure 2.2a 
The four l-cubes of one of the three (=3!/(1!2!)) 
possible ‘3 j7partitions of a 3-cube are indicated by 
P 


double lin (n=3, m=l, k=4, 4, y~partition={f} x 
; 


Shirt egee ie ei oe ee pina eg 
Figure 2.2b 
One of the two 2-cubes sharing exactly one PE with 
each of the l-cubes in Figure 2.2a is shown with 
double lines (complement of 4 ~partition = 2 7 
an 3,1 34.2 
partition ={2,0} ). 
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ree Figure 2.3a 
Initial configuration of rows of matrix A in a 4x4 
matrix multiplication. 


Figure 2.3b 
Result of broadcasting rows of A over 4th dimension 
and front to back edges (i.e., the 2-cubes of partition 
{ 2,3 by. The numbers indicate row numbers of matrix A. 


* 
Only one of eight 4th dimension connections is shown. 


Figure 2.3c 
Initial configuration of rows of matrix B and also 
the resulting configuration of Bt. 


Figure 2.3d 
Result of broadcasting rows of B™ over front and back 
faces (i.e., the 2-cubes of partition={0,1}). The 
numbers indicate row numbers of matrix Bt. 


* 
Only one of eight 4th dimension connections is shown. 


6.4 


Figure 2.3e 
Combined results of A and Bt broadcasts with 
products ready to be computed. 


* 
Only one of eight 4th dimension connections 


inner 


is shown. 
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* 
Only one of eight 4th dimension connections is shown. 


Figure 2.4 
A 4-cube is topologically equivalent to a 4x4 MCC with 
orthogonal wraparound. (All labels represent corresponding 


n-cube addresses; MCC labels are not shown.) 
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Figure 2.5 
A 4-cube is configured as a 2x8 MCC (unneeded connections 
are ignored). A hyperplane (line) then bisects the MCC 
into two 3-cubes where some of the connections are 
recovered from those ignored during the original config- 
uration. 
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hyperplane (column) {i923 0) 
in which 
rows of initial 
data are assigned 


eer orthogonal 
wraparound 


connection 
e ative a 
b b| one of the 
s eee O(S9)- stage 
1 “pipelines’ 


=— ce = . 
Nei Se eey 


Figure 2.6 


Example layout of a 2-dimensional 5 1X8, PDE problem 


A= 
1 eee N 
{ ate IN 1 +++ N 
| | | | 
Vit Vion Vnit VAN, 


Figure 3.1 
Structure representation of a NxN matrix 
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rocedure transpose (B, N) 

(trans : structure(l..N] ; 1 see note* 1 

initial trans — < > 

for I from 1 to N do 

new trans[iJ <— (_ 

“Yow : Structure[1l..N] ; 

initial row — < > 
for j from 1 to N do 
New rowl}] © B[j,i) 
return row) 

return trans) ; 


procedure multiply (A, Bt, N) 
(C : structure(1..N] ; 
initial C © < > 
Tor T from 1 to N do 
row A & Atl) ; 
new C{i] © ( 
row C : structure[{l..N] ; 
initial row c © < > 
for j from T to N do 
~~Gol BS BtTy] 3 
new row C[j] <— ( 
~Tnitial inner-prod <— @ 
for k from 1 to N do 
new Tnner_prod © inner prod + row-A[k]}*col -B[k] 
return inner -prod) = = = 
return row C) = 
return C) = 


Figure 3.2 


The call "multiply(A, transpose(B,N), N)"* 
returns the product of NxN matrix A and NxN matrix B. 


* 
This Pascal-like declaration is not part of Id and is for 
illustrative purposes only. 
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Procedure quicksSort (A, N 
(m <— N div 2 ; 

below, j}, above — ( 

above, below : 

initial below <— < > 

a" above & < > 

for i from l 

“= ie-t m 

~—— then (if A[i] 

~~then 


: 


else 


return (if j>1 


) 


structure[1] 


=2 8 


to N do 


then 


else 


th 


return ( 
sorted ;: 


en 


else 


structure[1l..N] 


quicksort (below, 
below), 


quicksort (above, 
above) ) 


! see 


w 

TT 

aa» 
=e ™e 


A[m] 


below[j+l] <— A[i] 
j < jt 
above[ktl] < Af[i] 
k & k4l)) 

3) 


k) 


° 
’ 


initial sorted < append(below, j+l, A[m]) 


for 1 from j+2 to N 


do 


new Sorted[i] © above[i-j-l] 


return sorted) ) 


Figure 3.3 
Id procedure to sort N element vector A 


note* ! 


~e 


=e 


* 
This Pascal-like declaration is not part of Id and is 


illustrative purposes onl 


Ye 
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for 


Fie Fret NS NS er eC ce. a eS ad 
| I 
| (initial x = <> | 
| ! 

for i from 1 to m do { 
| | 
| new x «= append(x,i,v) | 
; return x) | 
i \ 
As Se eh a ee et bh ge os. ee 


Figure 3.4 
Example compilation of an Id loop expression 
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(initial x = <> 


new x = append (x,i,v) 


{ 

| 

{ 

| for i from 1 to m do 
1 

\ 

return x) 


Uti, elope es he BS AA oN oe es Ses ee at 
read~onty {|I-structure} updateable 
I-structure| pointer I-structure 
pointer | Generator pointer 
x x 


return x 


Figure 3.5 
Compilation of the same program fragment as in Figure 
3.4 but treating x as an I-structure rather than a 
structure 
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(initial x © x, 1 xg contains initial PDE data 
represented by a q-level structure |! 


for k from 1 until (convergence is achieved) do 


new x & ( 


y: I-structure[@-.s__) 


initial y €& <@:x{@], (Sy_y-l) rx [sg_y-1]> 


-1] ; | see note” ! 


for i from 1 to 84-172 do 


new y{i] © ( 
y_sub_ plane : I-structure{@..S_5 
initial y sub_plane — <O:x[i,@], (s__ -1):x[i,Sq_ 


for j from l to S 4-272 do 
new y_ sub plane[j] < ( 
row : I-structure[@..S,-1] ; 
initial row <— <@:x[{i,j,...,8), (Sg-l):x[i,j,-+-15g-1)> 


for m from 1 to S_-2 do 


new row[m] < xf{i,j,...,m}/2 + 
(xCitl,j,...,m) + x{i-l,j,..«,m) + 
x[{i,jt+tl,...,m) + x[i,j-l,...,m) + 


x(1,j,.0-,mtl) + x{i,j,-..,m-1])/(4*q) 
return row) 


return y sub plane) 
return y) 
return x) 


Figure 3.6 
Sample Id program to solve for 
variable x (e.g., temperature) of a PDE 


* 
This Pascal-like declaration is not part of Id and is for 
illustrative purposes only. 
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Figure 3.7 
(q=3)-dimensional PDE initial ‘data structure x 


layout 
is shown for MCC plane (ip=0). Top level of xg may 
reside in any PE (e.g., rightmost PE above) in MCC plane 
(i9=0), xo[j] may reside in any PE (e.g., top PE above) 
in MCC column (i2=j, ig=0), and data vector x9[j,k] must 


reside in PE(j,k,0). 


physical domain d , physical domain d+ 1 


token buses 


global bus 


Figure 4.1 
The interconnection of processors (PEs), memory 
controllers (MCs), and memories (Ms) in the Irvine 
dataflow simulator. 
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2 3 4 5 6 7 
Problem Size 


Execution time complexity curves for a one-dimensional 
planar hydrodynamics simulation. The curves show that. 
I-structures reduce execution time through increased 
parallelism. The number of PEs used to achieve the 
minimum execution time appears adjacent to each point. 


Figure 4. 2 . a ets 
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Figure 4.3 
Execution time complexity curves for Gaussian elimination- 
The number of PEs used to achieve the minimum execution 


time appears adjacent to each point. 
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